Introduction

This report is intended to showcase and log all observations and insights from the projection of our Extended IMD datasets onto the IMD basis.

Background

The Extended IMDbasis project is a follow-up of the IMD basis, which main goal was to develop a method through which we could understand the aetiological relationships among multiple Immune mediated diseases (IMD) using genome-wide association studies (GWAS) summary statistic data. Understanding the shared (and differential) pathogenic patterns across complex diseases may provide useful insights that can help improving therapeutical intervention. More details about the method can be found at our paper Informed dimension reduction of clinically-related genome-wide association summary data characterises cross-trait axes of genetic risk.

This project intends to expand the approach used in the IMD basis by:

  1. Downloading and formatting GWAS summary statistics datasets on more traits, including other IMDs, biomarkers (such as cytokine levels), infectious diseases, response to vaccines, and psychiatric disorders. In doing so, increase the representativity of non-European populations across our datasets.
  2. Projecting the new datasets onto the basis to uncover novel patterns and relationships between different traits, as well as spot possible errors during file processing.
  3. Creating new bases from traits collected that are known to have a crucial importance in disease aetiology and that can be used to project the datasets and obtain new insights from a different perspective (eg. Cytokine/Blood cell count basis).
  4. Developing strategies to accomodate different ethnicities into our basis creation, so we can adjust for natural differences in allele frequencies across world populations.

In this report, we’ll focus on objective 2, by trying to extract insights and spot errors on already-processed files. All files were processed through a custom pipeline that included determining REF (reference, or non-effect allele) and ALT (effect allele) alleles, check that all relevant columns for projections were present, calculation of missing columns when possible, and liftover to hg38. After this, files were filtered by the IMD basis SNP manifest (a list of SNPs used to create the initial IMD basis, proved to be enough to capture most of the biological signal in the datasets), thus retaining only 556 SNPs, and flipping alleles when necessary (this was carried out using Projecting_dbs/Reducing_datasets_to_SNPmanifest.R). Finally, I projected all files and splitted the results in two tables: Projected_table_XXX.tsv and QC_table_XXX.tsv (where XXX stand for the date in which they were created). In addition, I created Extra_info_XXX.tsv, a modified version of Main_table containing useful information for analysis, such as N of samples, ancestry, chip, and extended trait name.

A bit of Exploratory Data analysis

First, let’s load the required libraries and files.

library(data.table)
library(cupcake)
library(pheatmap)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
library(ggplot2)
library(ggrepel)
library(gridExtra)
proj.table <- fread("Projected_table_20200317.tsv")
QC.table <- fread("QC_table_20200317.tsv")
extra.info <- fread("Extra_info_20200317.tsv")
QC.table <- merge.data.table(QC.table, extra.info, all.x = TRUE)

Checking matching SNPs

Let’s take a first glance at our projected datasets by doing some exploratory stats. The first concern is the number of SNPs in the manifest that we found in the files too, as this will have direct consequences for latter analyses.

hist(QC.table$nSNP, main = "Number of matching SNPs (All files)", xlab = "N matching SNPs")

Most of the datasets have a high number of shared SNPs, which is good. There are a number of them that have 300-350 SNPs. I suspect that part of them are ImmunoChip studies, since Immunochip is a targeted array comprising less SNPs than a typical genome-wide array.

hist(QC.table[QC.table$Chip != "ImmunoChip", ]$nSNP, main = "Number of matching SNPs w/o Immunochip", xlab = "N matching SNPs", breaks = 12)
hist(QC.table[QC.table$Chip == "ImmunoChip", ]$nSNP, main = "Number of matching SNPs (Immunochip only)", xlab = "N matching SNPs")

We observe that, although ImmunoChip studies have lower number of matches, they’re not the only ones. Let’s take a look by type of trait

par(mfrow = c(2,2))
hist(QC.table[QC.table$Trait_class == "IMD", ]$nSNP, main = "IMD", xlab = "N matching SNPs")
hist(QC.table[QC.table$Trait_class == "BMK", ]$nSNP, main = "Biomarkers", xlab = "N matching SNPs")
hist(QC.table[QC.table$Trait_class == "INF", ]$nSNP, main = "Infection-related traits", xlab = "N matching SNPs")
hist(QC.table[QC.table$Trait_class == "PSD", ]$nSNP, main = "Psychiatric traits", xlab = "N matching SNPs")

More stats!

summary(QC.table)
##     Trait                nSNP         overall_p           mscomp         
##  Length:470         Min.   :  3.0   Min.   :0.000000   Length:470        
##  Class :character   1st Qu.:556.0   1st Qu.:0.000025   Class :character  
##  Mode  :character   Median :556.0   Median :0.200000   Mode  :character  
##                     Mean   :531.4   Mean   :0.369026                     
##                     3rd Qu.:565.0   3rd Qu.:0.725000                     
##                     Max.   :568.0   Max.   :1.000000                     
##                                     NA's   :2                            
##    File_ID           Trait_long        Trait_class              N0          
##  Length:470         Length:470         Length:470         Min.   :    15.0  
##  Class :character   Class :character   Class :character   1st Qu.:   940.2  
##  Mode  :character   Mode  :character   Mode  :character   Median : 20209.5  
##                                                           Mean   : 57392.0  
##                                                           3rd Qu.: 94292.2  
##                                                           Max.   :824006.0  
##                                                                             
##        N1                 N            SNP_number           Chip          
##  Min.   :     0.0   Min.   :   165   Min.   :   10000   Length:470        
##  1st Qu.:     0.0   1st Qu.:  1000   1st Qu.: 8958989   Class :character  
##  Median :   230.5   Median : 33167   Median :15581317   Mode  :character  
##  Mean   :  4934.5   Mean   : 62339   Mean   :13103860                     
##  3rd Qu.:  2599.5   3rd Qu.: 96377   3rd Qu.:16147927                     
##  Max.   :107769.0   Max.   :898130   Max.   :29504835                     
##                                      NA's   :49                           
##   Population       
##  Length:470        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
QC.table[is.na(QC.table$overall_p),]

We see that of all 468 studies, 2 of them failed to be projected due to low SNP count.

These two files are special. MDD_Wray is a Meta-analysis, containing data from multiple studies. This particular file contained 23andMe data, which due to privacy reasons, they only made public 10k SNPs, which explains the low match. Fortunately, we have another file of the same meta-analysis, including many more SNPs.

Regarding T2D_Gaulton, a MetaboChip was used for genotyping, which contains ~46k SNPs. Being a targeted array with low number of SNPs genotyped, it makes sense that only 11 matched.

For the projected ones, we see how many, and what proportion of all files have less than 95%, 80%, and 50% of matching SNPs, respectively

QC.table <- QC.table[!is.na(QC.table$overall_p),]
c(lessthan95 = nrow(QC.table[QC.table$nSNP < 556*.95,]), lessthan80 = nrow(QC.table[QC.table$nSNP < 556*.8,]), lessthan50 = nrow(QC.table[QC.table$nSNP < 556*.5,]))
## lessthan95 lessthan80 lessthan50 
##         63         42          6
c(lessthan95 = nrow(QC.table[QC.table$nSNP < 556*.95,])/nrow(QC.table), lessthan80 = nrow(QC.table[QC.table$nSNP < max(QC.table$nSNP)*.8,])/nrow(QC.table), lessthan50 = nrow(QC.table[QC.table$nSNP < max(QC.table$nSNP)*.5,])/nrow(QC.table))
## lessthan95 lessthan80 lessthan50 
## 0.13461538 0.08974359 0.01282051

For the following analyses, we’ll focus on files that came from a Genome-wide array (ie. not Immunochip), and have at least 80% of matching SNPs. This will guarantee a certain degree of confidence in our observations, as well as statistical consistency (Statistical terminology to be improved/made more accurate!).

FT80 <- QC.table[QC.table$nSNP/max(QC.table$nSNP) >= .8 & !grepl("ImmunoChip",QC.table$Chip, ignore.case = TRUE), ]
basis.traits <- c("Asthma", "^CD$", "^CEL$", "IgA_Neph", "^LADA$", "^MS$","^PBC$","^PSC$","^RA$", "^SLE$", "^T1D$","^UC$", "^VIT$") # So we keep basis traits too
FT80.proj.table <- proj.table[grepl(paste(c(FT80$Trait, basis.traits), collapse = "|"), proj.table$Trait),]
FT80.proj.table <- merge(FT80.proj.table, FT80[,c("Trait", "Trait_class")], by = "Trait", all.x = TRUE) # So we don't lose original basis traits
FT80.proj.table[is.na(FT80.proj.table$Trait_class), ]$Trait_class <- "IMD" # Since original basis traits are all IMD
Less80 <- QC.table[QC.table$nSNP/max(QC.table$nSNP) < .8, ]
Less80.proj.table <- proj.table[grepl(paste(Less80$Trait, collapse = "|"), proj.table$Trait),]

FDR

Since we projected many studies, the resulting PC P-values should be adjusted for family-wise error rates. For that we’ll compute the FDR by trait class (IMD, BMK, and INF), and by PC (PC1,…,PC13), and apply a FDR threshold of 1%. PCs with significant differences from control (\(\delta\)) in either direction will be marked with ’*’ in the following heatmaps.

FT80.proj.table[, FDR.PC.trait:=p.adjust(P, method = "BH"), by=c("Trait_class", "PC")]
FT80.proj.table[, stars:=ifelse(!is.na(FDR.PC.trait) & FDR.PC.trait<0.01,"*","")]


Quality control

Exploring specific SNPs in Case-controls

A good QC measure is to check whether SNPs in case-control studies have their SE proportional to sample size. In order to do this, we’ll first select case-control studies, and see what are the most popular SNPs.

QC.casecontrols <- FT80[FT80$N1 > 0,]
main_df <- data.frame()
for(i in QC.casecontrols$Trait){
  x <- fread(paste("Filtered_datasets/",i,"-ft.tsv", sep = ""))
  x$filename <- i
  main_df <- rbind(main_df, x, fill= T)
}
filtered_df <- main_df[main_df$P < 1e-5, ]
mainSNPS <- as.data.table(table(filtered_df$SNPID))
manifest <- fread("Manifest_build_translator.tsv")
manifest[,pid19:=paste(CHR19, BP19, sep = ":")]
manifest <- manifest[,c(1,8)]
manifest <- merge(manifest, copy(cupcake::SNP.manifest), by.x = "pid19", by.y = "pid")
mainSNPS <- merge(mainSNPS, manifest[,c(1:4,6)], by.x = "V1", by.y="SNPID")
names(mainSNPS)[4:5] <- c("REF", "ALT")
setorder(mainSNPS, -N, ld.block)
head(mainSNPS, n = 50)
hist(mainSNPS$N, breaks = 15)

QC.N0N1.table <- merge(main_df, QC.table[,c(1, 8:10)], by.x = "filename", by.y = "Trait")
QC.N0N1.table$Z <- QC.N0N1.table$BETA/QC.N0N1.table$SE
QC.N0N1.table$N0N1Calc <- (QC.N0N1.table$N0 + QC.N0N1.table$N1)/(as.numeric(QC.N0N1.table$N0) * as.numeric(QC.N0N1.table$N1))

From this table we can select a number of SNPs to check in greater depth. For example:

rs2476601

rs2476601 is located in the PTPN22 gene and also known as R620W, C1858T, or 1858C>T, may influence risk for multiple autoimmune diseases, such as Rheumatoid Arthritis, Type-1 diabetes, Autoimmune thyroiditis, and Systemic lupus erythematosus.

rs2476601seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()
rs2476601zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
  geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs2476601",], abs(Z) > 4), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(yintercept = 4, linetype = "dashed", colour = "red")+
  theme_classic()
grid.arrange(rs2476601seplot,rs2476601zplot, nrow=2)

We see a clear outlier - MDD_Howard_29662059_1. This dataset has an outstandingly low SE for what we would expect by its sample size. This GWAS was made on UKBB samples. Other outliers in this and other SNPs include UTI,

We also see in this and subsequent plots that Tian_28928442 datasets also have less SE than expected. These GWASs used 23 and Me data.

Regarding the Vulcano plot, we see what we would expect: A allele (left) confers increased risk for several IMDs, such as RA, T1D and VIT, whereas G allele (right) confers increased risk for Crohn’s disease, as well as elevated levels of certain biomarkers (White blood cells, granulucytes, Neutrophils, Eosinophils, etc.), not considered in this case-control analysis.

rs8067378

rs8067378 is associated with childhood asthma.

rs8067378seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()

rs8067378zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
  geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs8067378",], abs(Z) > 5), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept =5), linetype = "dashed", colour = "red")+
  theme_classic()

grid.arrange(rs8067378seplot,rs8067378zplot, nrow=2)

rs7130588

rs7130588 is a regulatory region variant, associated with asthma and atopic dermatitis.

rs7130588seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], abs(BETA) > 2.5 | log(SE) < -5 | log(SE) > -1))+
    theme_classic()
rs7130588zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7130588",], abs(Z) > 4), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
  theme_classic()
grid.arrange(rs7130588seplot,rs7130588zplot, nrow=2)

rs11065987

rs11065987 is an intergenic variant, associated with platelet count, total cholesterol, and others. Ensembl, SNPedia.

rs11065987seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()

rs11065987zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs11065987",], abs(Z) > 4), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
  theme_classic()
grid.arrange(rs11065987seplot,rs11065987zplot, nrow=2)

rs12924729

rs12924729 is an intron variant, associated with risk to PBC (G allele). Ensembl.

rs12924729seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()

rs12924729zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs12924729",], abs(Z) > 4.5), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept = 4.5), linetype = "dashed", colour = "red")+
  theme_classic()

grid.arrange(rs12924729seplot,rs12924729zplot, nrow=2)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).

rs1801274

rs1801274 is a missense variant in the Fc fragment of IgG, low affinity IIa, receptor (CD32) FCGR2A gene. rs1801274(C) encodes the arginine (R) allele, with the (T) allele encoding the variant histidine (H). FcgR isoforms expressed on immune system cells have been linked to the pathogenic consequences triggered by autoantibodies or immune complexes in autoimmune diseases such as rheumatoid arthritis (RA) and systemic lupus erythematosus (SLE), as well as to the efficacy of some immunotherapeutic treatments such as rituximab.

rs1801274seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()
rs1801274zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
  geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1801274",], abs(Z) > 4), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
  theme_classic()
grid.arrange(rs1801274seplot,rs1801274zplot, nrow=2)
## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

rs7574865

rs7574865, a SNP in the third intron of the STAT4 gene, has been reported in a large study of Swedes to be associated with both rheumatoid arthritis (RA) and lupus (SLE). Among other studies, it has been confirmed in a meta-analysis of 8 studies totaling 7,381 patients and over 10,000 controls from both European and Asian populations. The risk allele (oriented to the dbSNP entry) is (T); the odds ratio associated the presence of a risk allele was 1.3 for rheumatoid arthritis and 1.55 for lupus (SLE). SNPedia

rs7574865seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()

rs7574865zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
  geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs7574865",], abs(Z) > 4), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
  theme_classic()
grid.arrange(rs7574865seplot,rs7574865zplot, nrow=2)

rs1050152

rs1050152 is a SNP in the SLC22A4 gene known as L503F, it has been associated with an autoimmune disease, specifically Crohn’s disease, being T the risk allele. SNPedia.

rs1050152seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()

rs1050152zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
  geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1050152",], abs(Z) > 4), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept = 4), linetype = "dashed", colour = "red")+
  theme_classic()
grid.arrange(rs1050152seplot,rs1050152zplot, nrow=2)
## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

rs1893217

rs1893217 is a SNP in PTPN2, linked to Crohn’s disease and type-1 diabetes. SNPedia.

rs1893217seplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], aes(x = log(N0N1Calc), y = log(SE), colour = BETA, label = filename)) +
    geom_point()+
    geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], abs(BETA) > 2.5 | log(SE) < -5))+
    theme_classic()

rs1893217zplot <- ggplot(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], aes(x = BETA, y = abs(Z), label = filename)) +
  geom_point()+
  geom_text_repel(data = subset(QC.N0N1.table[QC.N0N1.table$SNPID == "rs1893217",], abs(Z) > 2), size = 3)+
  geom_vline(xintercept = 0, linetype = "dashed", colour = "red")+
  geom_hline(aes(yintercept = 2), linetype = "dashed", colour = "red")+
  theme_classic()
grid.arrange(rs1893217seplot,rs1893217zplot, nrow=2)

Visualizing the projections

Overall significant traits

To begin with, we want to take a look at traits that are significantly different from \(\delta\) overall.

hmcol <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7", "#F7F7F7", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))(200)
PCorder <-  c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13")
FT80overall <- FT80
FT80overall[, FDR.overall:=p.adjust(overall_p, method = "BH"), by = "Trait_class"]
FT80overall <- FT80overall[FT80overall$FDR.overall < 0.05, ]
FT80overall.proj.table <- FT80.proj.table[grepl(paste(FT80overall$Trait, collapse = "|^"), FT80.proj.table$Trait), ]
Moverall <- acast(FT80overall.proj.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
Moverallstars <- acast(FT80overall.proj.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Moverall <- Moverall[,PCorder]
Moverallstars <- Moverallstars[,PCorder]
Moverall_col <- data.frame(FT80overall[grepl(paste(row.names(Moverall), collapse = "|"), FT80overall$Trait),], stringsAsFactors = F)
row.names(Moverall_col) <- Moverall_col$Trait
Moverall_col <- Moverall_col["Trait_class"]

pheatmap(Moverall, cluster_cols = FALSE, display_numbers = Moverallstars, fontsize_number = 11, color = hmcol, annotation_row = Moverall_col)

*** Observations

  • CD/UC/IBD cluster: There is a clear IBD-related cluster at the top of the heatmap, as we observed previously (although not as clear-cut in IBD subtypes as before). PSC (Primary Sclerosing Cholangitis) groups in this cluster, as we previously saw. In addition, related traits from FinnGen like ulcerative enterocholitis (UEC) and ulcerative ileocolitis (UILC) group within this cluster, as we’d expect. IBD FinnGen and UC Brant show weaker signal than their IBD/UC counterparts, which may explain why they’re apart from them. In the case of Brant, that GWAS was performed in African Americans. PC1 and PC3 are strong in the whole cluster, pointing in the same direction, whereas PC10 seems to split appart CD and UC, leaving IBD somewhere in the middle.

  • For FinnGen, despite being Europeans, Finnish are a very particular population, shaped by low effective sample sizes and numerous bottlenecks across their populational history, which sets them apart from the rest of European populations. This may play an important role on the distribution of FinnGen studies in these and other projection heatmaps, and we’ll surely need to take it into account in the future.

  • NMO and SSC: Neuromyelitis Optica and IgG+ NMO group together with Systemic sclerosis, showing a very similar projection pattern across all 13 PCs. NMO was previously considered a subtype of MS, which is coherent with its proximity in the clustering, and with SSc (although SSc affects different tissues, like connective tissue in the skin and internal organs, but not nervous tissue, apparently). NMO-SSc relationship has been mencioned a few times in the literature, with at least two cases of patients diagnosed for IgG+ NMO and later rediagnosed as SSC positive (Franciotta et al., 2011; Deeb et al., 2019). It also groups with European RA (Okada 1 and 3) but not with the 3 datasets in Japanese RA (Okada2, Kubo, and Ishigaki). SSC is also a rheumatic disease, which may help explain these relationships. We observe further down a couple of RA FinnGen datasets that group with SJOS and Still disease (SDAO), this lack of grouping with other European RA datasets might be explained by what I stated above. Also, NMO groups with SLE and Myastenia gravis, with which it has been previously associated. However, other associated diseases such as Sjögren’s syndrome (SJOS) and sarcoidosis (SARC), look to be further apart. A review from 2015 on NMO overlap with other diseases: Freitas & Guimaraes, 2015.

  • Arteritis (GCA, TAR) cluster: Although not significant for any PC, we observe a small cluster containing the three datasets we have on arteritis (Giant cell arteritis (GCA), Giant cell arteritis with polymyalgia rheumatica (GCAPMR), and temporal arteritis (TAR)).

  • Coeliac disease: Coeliac disease is on its own, as we saw earlier. There are two separated CEL datasets, the FinnGen one (N1 = 800) showing very weak signal.

  • Asthma/Allergic rhinitis cluster: Asthma in many of its forms groups with other allergic disorders, such as Allergic rhinitis (ALR), Allergic pollinosis (ALPOL), allergic asthma (ASTAL), childhood allergy (ALLCO)…and nasal polyps (NASP). This cluster seem to be characterized by a moderately strong signal at PC13, which we previously saw was related to eosinophil levels. This cluster doesn’t group with Eosinophil-related datasets, which have a much weaker signal at PC13. This is probably due to the differences in the nature of the trait (quantitative v. qualitative).

TBC

Basis traits

Since 215 traits is a lot to take a look at the same time, we’ll start by looking at a specific set of traits. Specifically, those present in the original basis, so we can have an overview of how new studies behave when projected onto a basis made from the same traits, and also to spot possible errors and unexpected behaviour in files. Note that some of the files included here are the same that were used to build the basis, so we can expect some exact matches.

We’ll start by filtering our dataframe by the traits in the basis:

all.basis.traits <- c("Asthma", "AST","^CD(_|$)", "CEL", "IgA_Neph", "IGAN", "LADA", "^MS(_|$)","PBC","PSC","^RA(_|$)", "^SLE(_|$)", "T1D","^UC(_|$)", "VIT(_|$)")
basis.table <- FT80.proj.table[grepl(paste(all.basis.traits, collapse = "|"), FT80.proj.table$Trait),]
Mbasis <- acast(basis.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
Mstars <- acast(basis.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Mbasis <- Mbasis[,PCorder]
Mstars <- Mstars[,PCorder]
pheatmap(Mbasis, cluster_cols = FALSE, display_numbers = Mstars, fontsize_number = 12, color = hmcol)

We observe several things. In general, studies tend to group by their trait, with Crohn’s disease and Ulcerative colitis grouping close together (and with PSC, as we previously saw). However we see some interesting stuff:

  • T1D_FinnGen Didn’t group with the basis’ T1D trait, despite having visually most PCs in the same direction (exceptions are PC10 and PC11, both not statistically significant).
  • ASTEOS_FinnGen don’t group where we would expect them to. This trait is “Suspicious of Eosinophilic Asthma”. Unfortunately, we don’t have any eosinophilic asthma dataset, but the low confidence in the diagnostic of the trait may render this result less trustworthy.
  • MSC_FinnGen groups properly with other MS datasets, however PC1 seems to be negative (-0.018), compared to its positive counterparts, although it’s not significant (FDR > 0.33). The rest of PCs seem to agree. NB: This GWAS has a relatively low case N (378).
  • Interestingly, we see that despite Asthma traits in FinnGen generally group with asthma phenotypes from other studies, ASTCO (Childhood-onset asthma) datasets do not group together (in fact, Finnish ASTCO groups closer to Ferreira ASTAO [Adult-onset asthma] than to ASTCO). For FinnGen, ASTCO patients were defined as <16 yo, for <19 in Ferreira study. In addition, number of cases in FinnGen is way lower (N1 = 992) than Ferreira (N1 = 13962). Both projections agree in all but PC11, which is significant for Ferreira (FDR < 1e-22) but not for FinnGen (FDR = 0.843).

All IMD

Now we see that the basis works well for the same traits used to build it, let’s expand our view to all IMD diseases.

IMD <- FT80[FT80$Trait_class == "IMD",]$Trait
IMD <- paste(IMD, collapse = "|")
IMD.table <- FT80.proj.table[grepl(IMD, FT80.proj.table$Trait),]
MIMD <- acast(IMD.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MIMDstars <- acast(IMD.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MIMD <-  MIMD[,PCorder]
MIMDstars <- MIMDstars[,PCorder]
pheatmap(MIMD, cluster_cols = FALSE, display_numbers = MIMDstars, fontsize_number = 10, color = hmcol)

All IMD (Europeans only)

As the basis was trained in European populations, populations with other backgrounds are expected to show differences in strength signal, possibly having an influence on clustering. We’ll take a look at datasets restircted to contain Europeans.

IMDeur <- FT80[FT80$Trait_class == "IMD" & grepl("*European*", FT80$Population),]$Trait
IMDeur.table <- FT80.proj.table[grep(paste(IMDeur, collapse = "|^"), FT80.proj.table$Trait),]
MIMDeur <- acast(IMDeur.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MIMDeurstars <- acast(IMDeur.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MIMDeur <-  MIMDeur[,PCorder]
MIMDeurstars <- MIMDeurstars[,PCorder]
pheatmap(MIMDeur, cluster_cols = FALSE, display_numbers = MIMDeurstars, fontsize_number = 10, color = hmcol)

  • Adult Still disease (SDAO) is a rare disease that causes high fevers, rash, and joint pain. It may lead to long-term (chronic) arthritis. It is considered an adult version of JIA, and it groups with Finngen RA and JIA, although the latter doesn’t have any significant PC.

Biomarkers

One type of study we are paying special attention are those of immune biomarker levels, by their role in defence against infections and in autoimmunity. We remove all H2P2 project datasets for now, since they deserve a special look on their own

bmks <- FT80[FT80$Trait_class == "BMK", ]$Trait
bmks <- bmks[grep("^HP", bmks, invert = TRUE)]
bmks.table <- FT80.proj.table[grepl(paste(bmks, collapse = "|"), FT80.proj.table$Trait),]
Mbmk <-  acast(bmks.table[,c(1,2,3)], Trait ~ PC)
## Using Delta as value column: use value.var to override.
Mbmkstars <- acast(bmks.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Mbmk <-  Mbmk[,PCorder]
Mbmkstars <- Mbmkstars[,PCorder]
pheatmap(Mbmk, cluster_cols = FALSE, display_numbers = Mbmkstars, fontsize_number = 10, color = hmcol)

  • We still see a certain degree of within-study clustering. Traits such as Eosinphil count (EOSC), present in Kanai and Astle appear futher away. An example of the opposite is Platelet count (PLAC), which groups in Kanai and Astle.
  • MIG and IP-10 appear together, which is expected as they’re functionally related, and share a receptor (CXCR3).
  • Two traits from O’Connor appear together and show similar signals: Haemophilus influenzae type b polyribosylribitol phosphate (PRP)-specific IgG concentrations (HIGC), and Tetanus toxoid-specific IgG concentrations (TIGGC), which makes sense, being IgG-related.

Biomarkers (European)

Just like we did before, let’s take a look only at European datasets

bmkeurs <- FT80[FT80$Trait_class == "BMK" & grepl("European", FT80$Population), ]$Trait
bmkeurs <- bmkeurs[grep("^HP", bmkeurs, invert = TRUE)]
bmkeurs.table <- FT80.proj.table[grepl(paste(bmkeurs, collapse = "|"), FT80.proj.table$Trait),]
Mbmkeur <-  acast(bmkeurs.table[,c(1,2,3)], Trait ~ PC)
## Using Delta as value column: use value.var to override.
Mbmkeurstars <- acast(bmkeurs.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
Mbmkeur <-  Mbmkeur[,PCorder]
Mbmkeurstars <- Mbmkeurstars[,PCorder]
pheatmap(Mbmkeur, cluster_cols = FALSE, display_numbers = Mbmkeurstars, fontsize_number = 10, color = hmcol)

IMD + Biomarkers (European only)

We’ll now project IMDs together with biomarkers, to see how they cluster together.

MIMDbmkeur <- rbind(Mbmkeur, MIMDeur)
MIMDbmkeurstars <- rbind(Mbmkeurstars, MIMDeurstars)
MIMDbmkeur <-  MIMDbmkeur[,PCorder]
MIMDbmkeurstars <- MIMDbmkeurstars[,PCorder]
MIMDbmkeur_col <- data.frame(FT80[grepl(paste(row.names(MIMDbmkeur), collapse = "|"), FT80$Trait),], stringsAsFactors = F)
row.names(MIMDbmkeur_col) <- MIMDbmkeur_col$Trait
MIMDbmkeur_col <- MIMDbmkeur_col["Trait_class"]

pheatmap(MIMDbmkeur, cluster_cols = FALSE, display_numbers = MIMDbmkeurstars, fontsize_number = 10, color = hmcol, annotation_row = MIMDbmkeur_col)

Most biomarkers cluster together, rather than with IMD traits. This may be due to a more subtle effect size for these biomarker trait, compared with those of IMDs, as we’ve previously seen. Some interesting things:

  • Some eosinophil-related biomarkers (eg. EOBAC, EOPG, EOSC and EOPL) group with Asthma and other allergic traits (Chronic obstructive pulmonary disease [COPD], Rhinitis and/or ezcema [RHEC], Atopic dermatitis [ATD], and Allergic conjuntivitis [ALCON]).

Infectious traits

We also collected data on infection-related traits, to analyze the other side of the coin.

INF <- FT80[FT80$Trait_class == "INF",]$Trait
INF.table <- FT80.proj.table[grepl(paste(INF, collapse = "|"), FT80.proj.table$Trait),]
MINF <- acast(INF.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MINFstars <- acast(INF.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MINF <-  MINF[,PCorder]
MINFstars <- MINFstars[,PCorder]
pheatmap(MINF, cluster_cols = FALSE, display_numbers = MINFstars, fontsize_number = 10, color = hmcol)

  • Not many significant PCs. Not many clear patterns.
  • Some respiratory infections (AST and COPD related infections, as well as pneumonia, but not TB) seem to group together.

Psychiatric diseases

Although the basis was built using IMDs, we want to explore its capacity to identify the relationships between other types of diseases, such as psychiatric diseases.

PSD <- FT80[FT80$Trait_class == "PSD",]$Trait
PSD.table <- FT80.proj.table[grepl(paste(PSD, collapse = "|"), FT80.proj.table$Trait),]
MPSD <- acast(PSD.table[,c(1,2,3)], Trait ~ PC) # PC, Trait, and Delta columns only
## Using Delta as value column: use value.var to override.
MPSDstars <- acast(PSD.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MPSD <-  MPSD[,PCorder]
MPSDstars <- MPSDstars[,PCorder]
pheatmap(MPSD, cluster_cols = FALSE, display_numbers = MPSDstars, fontsize_number = 10, color = hmcol)

Despite not having plenty of significant PCs, it is surprising the degree of clustering that we observe within disease (See AD, Alzheimer’s, except for Early-Onset, MDD [Major depression disorder], BPD [Bipolar disorder], or SCZ [Schizophrenia]) or among related diseases (eg, dementia in AD groups with AD). Of course, the clustering it’s not pergect, and we see several mismatches (BPD_Hou, which is in Europeans, as all the other BPD, and has a big N1 [7467]), as well as big discordances among some studies on the same trait (see Obsessive-compulsive disorder).


H2P2 Project

One of the most interesting group of datasets that we have is the one produced by the H2P2 Project (Wang et al., 2018). In this study, Wang and colleagues built a catalog of cellular genome-wide association studies (GWAS) comprising 79 infection-related phenotypes in response to 8 pathogens (non-typhi Salmonella, S.typhi, Chlamidia trichromatis, Staphylococcus aureus, Candida albicans, Mucor circinelloides, and Toxoplasma gondii) in 528 lymphoblastoid cell lines.

H2P2 <- FT80[grepl("HP[0-9]", FT80$Trait), ]
H2P2.table <- FT80.proj.table[grepl(paste(H2P2$Trait, collapse = "|"), FT80.proj.table$Trait),]
MH2P2 <-  acast(H2P2.table[,c(1,2,3)], Trait ~ PC)
## Using Delta as value column: use value.var to override.
MH2P2stars <- acast(H2P2.table[,c(1,2,9)], Trait ~ PC)
## Using stars as value column: use value.var to override.
MH2P2 <-  MH2P2[,PCorder]
MH2P2stars <- MH2P2stars[,PCorder]
pheatmap(MH2P2, cluster_cols = FALSE, display_numbers = MH2P2stars, fontsize_number = 10, color = hmcol)

We first projected these datasets using EMP_Beta, and we observed huge differences in signals, owing to differences in Beta scale across different datasets. I attempted to correct this by adjusting all datasets by the populational variance in each study, but as we see we still have problems, as no clear signal is distinguishable, and it looks very similar to where we started. This may be due to a number of things (eg. “over”-adjustment of certain files, as for some the factor was very small, leading to very big betas). We should carefully check!